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Abstract 

We describe the CoNLL-2003 shared task: 
language-independent named entity recog- 
nition. We give background information on 
the data sets (English and German) and 
the evaluation method, present a general 
overview of the systems that have taken 
part in the task and discuss their perfor- 
mance. 



1 Introduction 

Named entities are phrases that contain the names 
of persons, organizations and locations. Example: 

[ORG U.N. ] official [PER Ekeus ] heads for 
[LOC Baghdad ] . 

This sentence contains three named entities: Ekeus 
is a person, U.N. is a organization and Baghdad is 
a location. Named entity recognition is an impor- 
tant task of information extraction systems. There 
has been a lot of work on named entity recognition, 
especially for English (see Borthwick (1999) for an 
overview). The Message Understanding Conferences 
(MUC) have offered developers the opportunity to 
evaluate systems for English on the same data in a 
competition. They have also produced a scheme for 
entity annotation (Chinchor et al., 1999). More re- 
cently, there have been other system development 
competitions which dealt with different languages 
(IREX and CoNLL-2002). 

The shared task of CoNLL-2003 concerns 
language-independent named entity recognition. We 
will concentrate on four types of named entities: 
persons, locations, organizations and names of 
miscellaneous entities that do not belong to the pre- 
vious three groups. The shared task of CoNLL-2002 
dealt with named entity recognition for Spanish and 
Dutch (Tjong Kim Sang, 2002). The participants 



of the 2003 shared task have been offered training 
and test data for two other European languages: 
English and German. They have used the data 
for developing a named-entity recognition system 
that includes a machine learning component. The 
shared task organizers were especially interested in 
approaches that made use of resources other than 
the supplied training data, for example gazetteers 
and unannotated data. 

2 Data and Evaluation 

In this section we discuss the sources of the data 
that were used in this shared task, the preprocessing 
steps we have performed on the data, the format of 
the data and the method that was used for evaluating 
the participating systems. 

2.1 Data 

The CoNLL-2003 named entity data consists of eight 
files covering two languages: English and German 1 . 
For each of the languages there is a training file, a de- 
velopment file, a test file and a large file with unanno- 
tated data. The learning methods were trained with 
the training data. The development data could be 
used for tuning the parameters of the learning meth- 
ods. The challenge of this year's shared task was 
to incorporate the unannotated data in the learning 
process in one way or another. When the best pa- 
rameters were found, the method could be trained on 
the training data and tested on the test data. The 
results of the different learning methods on the test 
sets are compared in the evaluation of the shared 
task. The split between development data and test 
data was chosen to avoid systems being tuned to the 
test data. 

The English data was taken from the Reuters Cor- 
pus 2 . This corpus consists of Reuters news stories 



x Data files (except the words) can be found on 
http://lcg-www.uia.ac.be/conll2003/ner/ 

2 http://www. reuters.com/researchandstandards/ 



English data 


Articles 


Sentences 


Tokens 


English data 


LOC 


MISC 


ORG 


PER 


Training set 


946 


14,987 


203,621 




Training set 


7140 


3438 


6321 


6600 


Development set 


216 


3,466 


51,362 




Development set 


1837 


922 


1341 


1842 


Test set 


231 


3,684 


46,435 




Test set 


1668 


702 


1661 


1617 



German data 


Articles 


Sentences 


Tokens 


German data 


LOC 


MISC 


ORG 


PER 


Training set 


553 


12,705 


206,931 




Training set 


4363 


2288 


2427 


2773 


Development set 


201 


3,068 


51,444 




Development set 


1181 


1010 


1241 


1401 


Test set 


155 


3,160 


51,943 




Test set 


1035 


670 


773 


1195 



Table 1: Number of articles, sentences and tokens in 
each data file. 

between August 1996 and August 1997. For the 
training and development set, ten days' worth of data 
were taken from the files representing the end of Au- 
gust 1996. For the test set, the texts were from De- 
cember 1996. The preprocessed raw data covers the 
month of September 1996. 

The text for the German data was taken from the 
ECI Multilingual Text Corpus 3 . This corpus consists 
of texts in many languages. The portion of data that 
was used for this task, was extracted from the Ger- 
man newspaper Frankfurter Rundshau. All three of 
the training, development and test sets were taken 
from articles written in one week at the end of Au- 
gust 1992. The raw data were taken from the months 
of September to December 1992. 

Table 1 contains an overview of the sizes of the 
data files. The unannotated data contain 17 million 
tokens (English) and 14 million tokens (German). 

2.2 Data preprocessing 

The participants were given access to the corpus af- 
ter some linguistic preprocessing had been done: for 
all data, a tokenizer, part-of-speech tagger, and a 
chunker were applied to the raw data. We created 
two basic language-specific tokenizers for this shared 
task. The English data was tagged and chunked by 
the memory-based MBT tagger (Daelemans et al., 
2002). The German data was lemmatized, tagged 
and chunked by the decision tree tagger Treetagger 
(Schmid, 1995). 

Named entity tagging of English and German 
training, development, and test data, was done by 
hand at the University of Antwerp. Mostly, MUC 
conventions were followed (Chinchor et al., 1999). 
An extra named entity category called MISC was 
added to denote all names which are not already in 
the other categories. This includes adjectives, like 
Italian, and events, like 1000 Lakes Rally, making it 
a very diverse category. 



Table 2: Number of named entities per data file 



2.3 Data format 

All data files contain one word per line with empty 
lines representing sentence boundaries. At the end 
of each line there is a tag which states whether the 
current word is inside a named entity or not. The 
tag also encodes the type of named entity. Here is 
an example sentence: 



U.N. 

official 
Ekeus 
heads 
for 

Baghdad 



NNP 

NN 

NNP 

VBZ 

IN 

NNP 



I-NP 
I-NP 
I-NP 
I-VP 
I-PP 
I-NP 
O 



I-ORG 
O 

I-PER 

O 

O 

I-LOC 
O 



Each line contains four fields: the word, its part- 
of-speech tag, its chunk tag and its named entity 
tag. Words tagged with O are outside of named en- 
tities and the I-XXX tag is used for words inside a 
named entity of type XXX. Whenever two entities of 
type XXX are immediately next to each other, the 
first word of the second entity will be tagged B-XXX 
in order to show that it starts another entity. The 
data contains entities of four types: persons (PER), 
organizations (ORG), locations (LOC) and miscel- 
laneous names (MISC). This tagging scheme is the 
IOB scheme originally put forward by Ramshaw and 
Marcus (1995). We assume that named entities are 
non-recursive and non-overlapping. When a named 
entity is embedded in another named entity, usually 
only the top level entity has been annotated. 

Table 2 contains an overview of the number of 
named entities in each data file. 

2.4 Evaluation 

The performance in this task is measured with F^ = i 
rate: 



http : / / www .ldc.upenn.edu/ 



F a = 



(0 1 + 1) * precision * recall 
{(3 2 * precision + recall) 



(1) 
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Table 3: Main features used by the the sixteen systems that participated in the CoNLL-2003 shared task 
sorted by performance on the English test data. Aff: affix information (n-grams); bag: bag of words; cas: 
global case information; chu: chunk tags; doc: global document information; gaz: gazetteers; lex: lexical 
features; ort: orthographic information; pat: orthographic patterns (like AaO); pos: part-of-speech tags; pre: 
previously predicted NE tags; quo: flag signing that the word is between quotes; tri: trigger words. 



with (3—1 (Van Rijsbergen, 1975). Precision is the 
percentage of named entities found by the learning 
system that are correct. Recall is the percentage of 
named entities present in the corpus that are found 
by the system. A named entity is correct only if it 
is an exact match of the corresponding entity in the 
data file. 

3 Participating Systems 

Sixteen systems have participated in the CoNLL- 
2003 shared task. They employed a wide variety of 
machine learning techniques as well as system com- 
bination. Most of the participants have attempted 
to use information other than the available train- 
ing data. This information included gazetteers and 
unannotated data, and there was one participant 
who used the output of externally trained named en- 
tity recognition systems. 

3.1 Learning techniques 

The most frequently applied technique in the 
CoNLL-2003 shared task is the Maximum Entropy 
Model. Five systems used this statistical learning 
method. Three systems used Maximum Entropy 
Models in isolation (Bender et al., 2003; Chieu and 
Ng, 2003; Curran and Clark, 2003). Two more 
systems used them in combination with other tech- 
niques (Florian et al., 2003; Klein et al., 2003). Max- 
imum Entropy Models seem to be a good choice for 



this kind of task: the top three results for English 
and the top two results for German were obtained 
by participants who employed them in one way or 
another. 

Hidden Markov Models were employed by four of 
the systems that took part in the shared task (Flo- 
rian et al., 2003; Klein et al., 2003; Mayfield et al., 
2003; Whitelaw and Patrick, 2003). However, they 
were always used in combination with other learning 
techniques. Klein et al. (2003) also applied the re- 
lated Conditional Markov Models for combining clas- 
sifiers. 

Learning methods that were based on connection- 
ist approaches were applied by four systems. Zhang 
and Johnson (2003) used robust risk minimization, 
which is a Winnow technique. Florian et al. (2003) 
employed the same technique in a combination of 
learners. Voted perceptrons were applied to the 
shared task data by Carreras et al. (2003a) and 
Hammerton used a recurrent neural network (Long 
Short-Term Memory) for finding named entities. 

Other learning approaches were employed less fre- 
quently. Two teams used AdaBoost.MH (Carreras 
et al., 2003b; Wu et al., 2003) and two other groups 
employed memory-based learning (De Meulder and 
Daelemans, 2003; Hendrickx and Van den Bosch, 
2003). Transformation-based learning (Florian et 
al., 2003), Support Vector Machines (Mayfield et al., 
2003) and Conditional Random Fields (McCallum 



and Li, 2003) were applied by one system each. 

Combination of different learning systems has 
proven to be a good method for obtaining excellent 
results. Five participating groups have applied sys- 
tem combination. Florian et al. (2003) tested dif- 
ferent methods for combining the results of four sys- 
tems and found that robust risk minimization worked 
best. Klein et al. (2003) employed a stacked learn- 
ing system which contains Hidden Markov Models, 
Maximum Entropy Models and Conditional Markov 
Models. Mayfield et al. (2003) stacked two learners 
and obtained better performance. Wu et al. (2003) 
applied both stacking and voting to three learners. 
Munro et al. (2003) employed both voting and bag- 
ging for combining classifiers. 

3.2 Features 

The choice of the learning approach is important for 
obtaining a good system for recognizing named en- 
tities. However, in the CoNLL-2002 shared task we 
found out that choice of features is at least as impor- 
tant. An overview of some of the types of features 
chosen by the shared task participants, can be found 
in Table 3. 

All participants used lexical features (words) ex- 
cept for Whitelaw and Patrick (2003) who imple- 
mented a character-based method. Most of the sys- 
tems employed part-of-speech tags and two of them 
have recomputed the English tags with better tag- 
gers (Hendrickx and Van den Bosch, 2003; Wu et al., 
2003). Othographic information, affixes, gazetteers 
and chunk information were also incorporated in 
most systems although one group reports that the 
available chunking information did not help (Wu et 
al., 2003) Other features were used less frequently 
Table 3 does not reveal a single feature that would 
be ideal for named entity recognition. 

3.3 External resources 

Eleven of the sixteen participating teams have at- 
tempted to use information other than the training 
data that was supplied for this shared task. All in- 
cluded gazetteers in their systems. Four groups ex- 
amined the usability of unannotated data, either for 
extracting training instances (Bender et al., 2003; 
Hendrickx and Van den Bosch, 2003) or obtaining 
extra named entities for gazetteers (De Moulder and 
Daelemans, 2003; McCallum and Li, 2003). A rea- 
sonable number of groups have also employed unan- 
notated data for obtaining capitalization features for 
words. One participating team has used externally 
trained named entity recognition systems for English 
as a part in a combined system (Florian et al., 2003). 
Table 4 shows the error reduction of the systems 





G 


U 


E 


English 


German 


Zhang 


+ 


- 


- 


19% 


15% 


Florian 


+ 


- 


+ 


27% 


5% 


Hammerton 


+ 


- 


- 


22% 


- 


Carreras (a) 


+ 


- 


- 


12% 


8% 


Chieu 


+ 


- 


- 


17% 


- 


Hendrickx 


+ 


+ 




7% 


5% 


De Meulder 


+ 


+ 




8% 


3% 


Bender 


+ 


+ 




3% 


6% 


Curran 


+ 






1% 




McCallum 


+ 


+ 




? 


? 


Wu 


+ 











Table 4: Error reduction for the two develop- 
ment data sets when using extra information like 
gazetteers (G), unannotated data (U) or externally 
developed named entity recognizers (E). The lines 
have been sorted by the sum of the reduction per- 
centages for the two languages. 

with extra information compared to while using only 
the available training data. The inclusion of ex- 
tra named entity recognition systems seems to have 
worked well (Florian et al., 2003). Generally the sys- 
tems that only used gazetteers seem to gain more 
than systems that have used unannotated data for 
other purposes than obtaining capitalization infor- 
mation. However, the gain differences between the 
two approaches are most obvious for English, for 
which better gazetteers are available. With the ex- 
ception of the result of Zhang and Johnson (2003), 
there is not much difference in the German results 
between the gains obtained by using gazetteers and 
those obtained by using unannotated data. 

3.4 Performances 

A baseline rate was computed for the English and the 
German test sets. It was produced by a system which 
only identified entities which had a unique class in 
the training data. If a phrase was part of more than 
one entity, the system would select the longest one. 
All systems that participated in the shared task have 
outperformed the baseline system. 

For all the Fp=i rates we have estimated sig- 
nificance boundaries by using bootstrap resampling 
(Noreen, 1989). From each output file of a system, 
250 random samples of sentences have been chosen 
and the distribution of the Fp—i rates in these sam- 
ples is assumed to be the distribution of the perfor- 
mance of the system. We assume that performance 
A is significantly different from performance B if A 
is not within the center 90% of the distribution of B. 

The performances of the sixteen systems on the 



two test data sets can be found in Table 5. For En- 
glish, the combined classifier of Florian et al. (2003) 
achieved the highest overall F | g = i rate. However, 
the difference between their performance and that 
of the Maximum Entropy approach of Chieu and Ng 
(2003) is not significant. An important feature of the 
best system that other participants did not use, was 
the inclusion of the output of two externally trained 
named entity recognizers in the combination process. 
Florian ct al. (2003) have also obtained the highest 
Fp=i rate for the German data. Here there is no sig- 
nificant difference between them and the systems of 
Klein ct al. (2003) and Zhang and Johnson (2003). 

We have combined the results of the sixteen sys- 
tem in order to see if there was room for improve- 
ment. We converted the output of the systems to 
the same IOB tagging representation and searched 
for the set of systems from which the best tags for 
the development data could be obtained with ma- 
jority voting. The optimal set of systems was de- 
termined by performing a bidirectional hill-climbing 
search (Caruana and Freitag, 1994) with beam size 9, 
starting from zero features. A majority vote of five 
systems (Chieu and Ng, 2003; Florian et al., 2003; 
Klein ct al., 2003; McCallum and Li, 2003; Whitelaw 
and Patrick, 2003) performed best on the English 
development data. Another combination of five sys- 
tems (Carreras et al., 2003b; Mayficld et al., 2003; 
McCallum and Li, 2003; Munro et al., 2003; Zhang 
and Johnson, 2003) obtained the best result for the 
German development data. We have performed a 
majority vote with these sets of systems on the re- 
lated test sets and obtained Fp=i rates of 90.30 for 
English (14% error reduction compared with the best 
system) and 74.17 for German (6% error reduction). 

4 Concluding Remarks 

We have described the CoNLL-2003 shared task: 
language-independent named entity recognition. 
Sixteen systems have processed English and German 
named entity data. The best performance for both 
languages has been obtained by a combined learn- 
ing system that used Maximum Entropy Models, 
transformation-based learning, Hidden Markov Mod- 
els as well as robust risk minimization (Florian et al., 
2003). Apart from the training data, this system also 
employed gazetteers and the output of two externally 
trained named entity recognizers. The performance 
of the system of Chieu et al. (2003) was not signif- 
icantly different from the best performance for En- 
glish and the method of Klein et al. (2003) and the 
approach of Zhang and Johnson (2003) were not sig- 
nificantly worse than the best result for German. 
Eleven teams have incorporated information other 



than the training data in their system. Four of them 
have obtained error reductions of 15% or more for 
English and one has managed this for German. The 
resources used by these systems, gazetteers and ex- 
ternally trained named entity systems, still require a 
lot of manual work. Systems that employed unanno- 
tated data, obtained performance gains around 5%. 
The search for an excellent method for taking advan- 
tage of the vast amount of available raw text, remains 
open. 
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English test 
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F/3=l 
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76.97±1.2 
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69.09% 


53.26% 


60.15±1.3 


Baseline 


71.91% 


50.90% 


59.61±1.2 



German test 


Precision 


Recall 


F^i 


Florian 


83.87% 


63.71% 


72.41±1.3 


Klein 


80.38% 


65.04% 
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Zhang 
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63.03% 


71.27±1.5 
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59.35% 


66.34±1.3 
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Hendrickx 


71.15% 


56.55% 


63.02±1.4 


De Meulder 


63.93% 


51.86% 


57.27±1.6 


Whitelaw 


71.05% 


44.11% 


54.43±1.4 


Hammcrton 


63.49% 


38.25% 


47.74±1.5 


Baseline 


31.86% 


28.89% 


30.30±1.3 



Table 5: Overall precision, recall and F | g = i rates ob- 
tained by the sixteen participating systems on the 
test data sets for the two languages in the CoNLL- 
2003 shared task. 
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